Individual Assingment: Default Credit Cards Clients¶

A bank in you region want to build a model to predict credit card defaulters more accurately in order minimize money loss. For this task we have a set of data on default payments and demographic data to help us do our task. Data is comprised in the following CSV files:

TRAINING

train_customers.csv

  • ID: ID of each client
  • LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit
  • SEX: Gender (1=male, 2=female)
  • EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
  • MARRIAGE: Marital status (1=married, 2=single, 3=others)
  • AGE: Age in years

train_series.csv

  • ID: ID of each client
  • MONTH: The month to wich data is refering
  • PAY: Repayment status in the corresponding month (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)
  • BILL_AMT: Amount of bill statement in the corresponding month (NT dollar)
  • PAY_AMT: Amount of previous payment in the corresponding month (NT dollar)

train_target.csv

  • DEFAULT_JULY: Default payment in July (1=yes, 0=no)

TEST (For PART 3)

  • test_data.csv

SUBMISSION (For PART 3)

  • submission_features.csv

BACKUP

  • train_data.csv

Load all data and answer the following questions

In [1]:
import pandas as pd
import plotly.express as px
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import roc_auc_score, roc_curve, classification_report, accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
import plotly.graph_objects as go
import seaborn as sn
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', None)
In [2]:
train_customers = pd.read_csv('train_customers.csv')
train_series = pd.read_csv('train_series.csv')
train_target = pd.read_csv('train_target.csv')
test_data = pd.read_csv('test_data.csv')
submission_features = pd.read_csv('submission_features.csv')
train_data = pd.read_csv('train_data.csv')

A. DATA ENGINEERING (3 points)¶

Question 1¶

(2 points)

In [3]:
train_series.head()
Out[3]:
ID MONTH PAY BILL_AMT PAY_AMT
0 9910 JUNE 0.0 76885.0 3500.0
1 9910 MAY 0.0 79106.0 4000.0
2 9910 APRIL 0.0 81231.0 3000.0
3 9910 MARCH 0.0 81983.0 3146.0
4 9910 FEBRUARY 0.0 83773.0 3260.0
In [4]:
ts_pivot = train_series.pivot(index='ID', columns='MONTH', values=['PAY', 'BILL_AMT', 'PAY_AMT'])
ts_pivot.columns = [f'{i}_{j}' if j != '' else f'{i}' for i,j in ts_pivot.columns]
ts_pivot.head()
Out[4]:
PAY_APRIL PAY_FEBRUARY PAY_JANUARY PAY_JUNE PAY_MARCH PAY_MAY BILL_AMT_APRIL BILL_AMT_FEBRUARY BILL_AMT_JANUARY BILL_AMT_JUNE BILL_AMT_MARCH BILL_AMT_MAY PAY_AMT_APRIL PAY_AMT_FEBRUARY PAY_AMT_JANUARY PAY_AMT_JUNE PAY_AMT_MARCH PAY_AMT_MAY
ID
0 0.0 -1.0 0.0 -1.0 0.0 0.0 13600.0 44024.0 18697.0 28991.0 0.0 24391.0 0.0 680.0 10000.0 1300.0 22373.0 1000.0
1 0.0 0.0 0.0 0.0 0.0 0.0 53169.0 50372.0 49470.0 75662.0 50875.0 70073.0 2000.0 1903.0 2006.0 3212.0 1603.0 2106.0
2 2.0 2.0 2.0 2.0 2.0 2.0 161487.0 168094.0 170922.0 155910.0 157577.0 158819.0 0.0 5500.0 1000.0 6800.0 13000.0 6500.0
3 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 0.0 0.0 0.0 23570.0 0.0 735.0 0.0 0.0 0.0 742.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 9044.0 9417.0 9617.0 8214.0 9225.0 8034.0 331.0 356.0 330.0 1140.0 341.0 1150.0

Question 2¶

(0.5 points)

In [5]:
train_customers.head()
Out[5]:
ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE
0 9910 130000.0 2 2 1 27
1 15561 60000.0 2 3 1 48
2 23607 330000.0 2 1 2 44
3 6314 60000.0 2 2 2 24
4 27534 180000.0 2 1 2 33
In [6]:
tc_merged = pd.merge(train_customers, ts_pivot, on='ID', how='left')
tc_merged.head()
Out[6]:
ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_APRIL PAY_FEBRUARY PAY_JANUARY PAY_JUNE PAY_MARCH PAY_MAY BILL_AMT_APRIL BILL_AMT_FEBRUARY BILL_AMT_JANUARY BILL_AMT_JUNE BILL_AMT_MARCH BILL_AMT_MAY PAY_AMT_APRIL PAY_AMT_FEBRUARY PAY_AMT_JANUARY PAY_AMT_JUNE PAY_AMT_MARCH PAY_AMT_MAY
0 9910 130000.0 2 2 1 27 0.0 0.0 0.0 0.0 0.0 0.0 81231.0 83773.0 85532.0 76885.0 81983.0 79106.0 3000.0 3260.0 3200.0 3500.0 3146.0 4000.0
1 15561 60000.0 2 3 1 48 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 8422.0 3910.0 2431.0 4823.0 4377.0 5471.0 4377.0 2431.0 2120.0 5491.0 3918.0 9683.0
2 23607 330000.0 2 1 2 44 2.0 0.0 0.0 0.0 0.0 0.0 253863.0 262753.0 268145.0 243621.0 257535.0 259517.0 9400.0 9766.0 9786.0 21400.0 9542.0 0.0
3 6314 60000.0 2 2 2 24 0.0 0.0 0.0 0.0 0.0 0.0 15069.0 5879.0 -2879.0 29832.0 15761.0 18628.0 1000.0 1329.0 41378.0 1538.0 118.0 1308.0
4 27534 180000.0 2 1 2 33 -1.0 -1.0 -1.0 -2.0 -1.0 -2.0 178.0 0.0 1118.0 4500.0 1500.0 2580.0 1500.0 1118.0 1331.0 2580.0 0.0 178.0

Question 3¶

(0.5 points)

In [7]:
train_target.head()
Out[7]:
ID DEFAULT_JULY
0 9910 0
1 15561 1
2 23607 1
3 6314 0
4 27534 0
In [8]:
merged_data = pd.merge(tc_merged, train_target, on='ID', how='left')
merged_data.head()
Out[8]:
ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_APRIL PAY_FEBRUARY PAY_JANUARY PAY_JUNE PAY_MARCH PAY_MAY BILL_AMT_APRIL BILL_AMT_FEBRUARY BILL_AMT_JANUARY BILL_AMT_JUNE BILL_AMT_MARCH BILL_AMT_MAY PAY_AMT_APRIL PAY_AMT_FEBRUARY PAY_AMT_JANUARY PAY_AMT_JUNE PAY_AMT_MARCH PAY_AMT_MAY DEFAULT_JULY
0 9910 130000.0 2 2 1 27 0.0 0.0 0.0 0.0 0.0 0.0 81231.0 83773.0 85532.0 76885.0 81983.0 79106.0 3000.0 3260.0 3200.0 3500.0 3146.0 4000.0 0
1 15561 60000.0 2 3 1 48 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 8422.0 3910.0 2431.0 4823.0 4377.0 5471.0 4377.0 2431.0 2120.0 5491.0 3918.0 9683.0 1
2 23607 330000.0 2 1 2 44 2.0 0.0 0.0 0.0 0.0 0.0 253863.0 262753.0 268145.0 243621.0 257535.0 259517.0 9400.0 9766.0 9786.0 21400.0 9542.0 0.0 1
3 6314 60000.0 2 2 2 24 0.0 0.0 0.0 0.0 0.0 0.0 15069.0 5879.0 -2879.0 29832.0 15761.0 18628.0 1000.0 1329.0 41378.0 1538.0 118.0 1308.0 0
4 27534 180000.0 2 1 2 33 -1.0 -1.0 -1.0 -2.0 -1.0 -2.0 178.0 0.0 1118.0 4500.0 1500.0 2580.0 1500.0 1118.0 1331.0 2580.0 0.0 178.0 0
In [9]:
# This creates the csv used in submission
#merged_data.to_csv('Submissions/FordArturo_A_train.csv', index=False)

B. EXPLORATORY DATA ANALYSIS (2 points)¶

Question 1¶

(0.5 points)

Per the data, it seems that most individuals who tend to default also achieve a higher education, something that isn't necessarily intuitive at first glance. However, conclusions cannot be drawn only based on this histogram as it doesn't show the proportion of those who don't default for a proper comparison. It could well just be the case that more individuals with higher education are credit card users overall, thereby inflating the amounts on the histogram regardless of their proportion to those who don't default.

In [10]:
defaulters = merged_data[merged_data.DEFAULT_JULY == 1]['EDUCATION']
defaulters_hist = px.histogram(defaulters, x='EDUCATION')
defaulters_hist

Question 2¶

(0.5 points)

The data indicates that those who default in July tend to have a less Balance Limit than their non-default counterparts. However, this isn't a definite indicator that a given person might default, as evidenced in the correlation matrix below. While it is a pattern that might be logical, considering lenders tend to be cautious about giving out higher limits, it seems that the trend is not strong enough to be an accurate predictor on its own.

In [11]:
lim_bal_defaults = merged_data[['LIMIT_BAL', 'DEFAULT_JULY']]
lim_bal_hist = px.box(lim_bal_defaults, x='DEFAULT_JULY', y='LIMIT_BAL')
lim_bal_hist

Question 3¶

(1 point)

In [12]:
full_df = merged_data.copy()
bins = [17, 25, 35, 45, 55, 65, 1000]
labels = ['18-25', '26-35', '36-45', '46-55', '56-65', '66+']
full_df['AGE_BIN'] = pd.cut(full_df['AGE'], bins=bins, labels=labels)
full_df.head()
Out[12]:
ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_APRIL PAY_FEBRUARY PAY_JANUARY PAY_JUNE PAY_MARCH PAY_MAY BILL_AMT_APRIL BILL_AMT_FEBRUARY BILL_AMT_JANUARY BILL_AMT_JUNE BILL_AMT_MARCH BILL_AMT_MAY PAY_AMT_APRIL PAY_AMT_FEBRUARY PAY_AMT_JANUARY PAY_AMT_JUNE PAY_AMT_MARCH PAY_AMT_MAY DEFAULT_JULY AGE_BIN
0 9910 130000.0 2 2 1 27 0.0 0.0 0.0 0.0 0.0 0.0 81231.0 83773.0 85532.0 76885.0 81983.0 79106.0 3000.0 3260.0 3200.0 3500.0 3146.0 4000.0 0 26-35
1 15561 60000.0 2 3 1 48 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 8422.0 3910.0 2431.0 4823.0 4377.0 5471.0 4377.0 2431.0 2120.0 5491.0 3918.0 9683.0 1 46-55
2 23607 330000.0 2 1 2 44 2.0 0.0 0.0 0.0 0.0 0.0 253863.0 262753.0 268145.0 243621.0 257535.0 259517.0 9400.0 9766.0 9786.0 21400.0 9542.0 0.0 1 36-45
3 6314 60000.0 2 2 2 24 0.0 0.0 0.0 0.0 0.0 0.0 15069.0 5879.0 -2879.0 29832.0 15761.0 18628.0 1000.0 1329.0 41378.0 1538.0 118.0 1308.0 0 18-25
4 27534 180000.0 2 1 2 33 -1.0 -1.0 -1.0 -2.0 -1.0 -2.0 178.0 0.0 1118.0 4500.0 1500.0 2580.0 1500.0 1118.0 1331.0 2580.0 0.0 178.0 0 26-35

Insight 1: Age of defaulters¶

In this pie chart we explore the age ranges of those individuals who default in July. The data points out that the vast majority of those who default (68.7% combined) are in the age groups of 26-35 and 36-45. Given the large number, it could indicate a pattern between age and difficulty making payments. More insights can be provided if we look at ratios of the total amount of customers in those categories.

In [13]:
#Insight 1: Age of Defaulters
sort_age = full_df.sort_values(by=['AGE']).copy()
chart1 = px.pie(sort_age, values='AGE', names='AGE_BIN', title='Defaulters by age group')
chart1.show()

Insight 2: Marriage of Defaulters¶

By using this Tree Map, we can see how marriage status can affect the likelihood of a person defaulting. By looking at the resulting amounts within the boxes, we can see that the marriage status is almost evenly split in both the 'Not defaulted' and the 'Defaulted' groups. There is a little more single non-defaulters than married non-defaulters, but it is not very significant. It seems that this is not a good indicator of likelihood of defaulting.

In [14]:
#Insight 2: Marriage of Defaulters
full_df['DEF_LABEL'] = ['Defaulted' if i == 1 else 'Not defaulted' for i in full_df['DEFAULT_JULY']]
full_df['MARRIAGE_LABEL'] = ['Married' if i == 1 else 'Single' if i == 2 else 'Other' for i in full_df['MARRIAGE']]
chart2 = px.treemap(full_df, path=[px.Constant("Defaults in July"), 'DEF_LABEL', 'MARRIAGE_LABEL'])
chart2.show()

Insight 3: Payment Status per Education group¶

Data shows that there is a much larger share of those who pay duly or early if they have achieved a higher education, in comparison to those in the late pay categories. In those, we see roughly the same amounts for each group, so this could mean that those who pay late do so regardless of their educational backgrounds. In contrast, those who do pay in time tend to have achieved a higher education.

In [15]:
#Insight 3: Payment status per Education group
filter_educ = full_df[full_df['EDUCATION'].isin([1,2,3,4])].copy()
filter_educ['EDU_LABEL'] = ['Graduate' if i == 1 else 'University' if i == 2 else 'High School' if i == 3 else 'Other' for i in filter_educ['EDUCATION']]
filter_educ['PAY_JUNE_LABEL'] = ['Early Pay' if i < 0 else 'Pay Duly' if i == 0 else 'Late 1 month' if i == 1 else 'Late 2 months' if i == 2 else 'Late >=3 months' for i in filter_educ['PAY_JUNE']]
filter_educ = filter_educ.sort_values(by=['PAY_JUNE'])
chart3 = px.histogram(filter_educ, x='PAY_JUNE_LABEL', color='EDU_LABEL')
chart3

C. MACHINE LEARNING (5 points)¶

In this part, we're going to build a machine learning model to estimate the probability of default of the customers. For this part, you will need two additional datasets:

test_data.csv - Contains new data from 9000 customers with the same structure as the training data from Part 1. It contains all the features and the target already joined. You can use this data set for testing purposes (or you can build your own test set from the training set). Anyway, keep it because you will need it for question C.2 submission_features.csv - This contains data associated with unseen customers, but without the target.

SOME CONSIDERATIONS Try several models to select the best suited for this problem Don't forget about hyperparameter tuning to select the best configuration for the model Don't forget to check the different metrics in each CV round to build a robust model

In [16]:
to_train = merged_data.copy()
to_train.head()
Out[16]:
ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_APRIL PAY_FEBRUARY PAY_JANUARY PAY_JUNE PAY_MARCH PAY_MAY BILL_AMT_APRIL BILL_AMT_FEBRUARY BILL_AMT_JANUARY BILL_AMT_JUNE BILL_AMT_MARCH BILL_AMT_MAY PAY_AMT_APRIL PAY_AMT_FEBRUARY PAY_AMT_JANUARY PAY_AMT_JUNE PAY_AMT_MARCH PAY_AMT_MAY DEFAULT_JULY
0 9910 130000.0 2 2 1 27 0.0 0.0 0.0 0.0 0.0 0.0 81231.0 83773.0 85532.0 76885.0 81983.0 79106.0 3000.0 3260.0 3200.0 3500.0 3146.0 4000.0 0
1 15561 60000.0 2 3 1 48 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 8422.0 3910.0 2431.0 4823.0 4377.0 5471.0 4377.0 2431.0 2120.0 5491.0 3918.0 9683.0 1
2 23607 330000.0 2 1 2 44 2.0 0.0 0.0 0.0 0.0 0.0 253863.0 262753.0 268145.0 243621.0 257535.0 259517.0 9400.0 9766.0 9786.0 21400.0 9542.0 0.0 1
3 6314 60000.0 2 2 2 24 0.0 0.0 0.0 0.0 0.0 0.0 15069.0 5879.0 -2879.0 29832.0 15761.0 18628.0 1000.0 1329.0 41378.0 1538.0 118.0 1308.0 0
4 27534 180000.0 2 1 2 33 -1.0 -1.0 -1.0 -2.0 -1.0 -2.0 178.0 0.0 1118.0 4500.0 1500.0 2580.0 1500.0 1118.0 1331.0 2580.0 0.0 178.0 0
In [17]:
# This function creates 1 or 0 values if the person in question has ever missed a payment.
def has_missed_pay(to_train):
    if ((to_train['PAY_JUNE'] > 0) or
        (to_train['PAY_MAY'] > 0) or
        (to_train['PAY_APRIL'] > 0) or
        (to_train['PAY_MARCH'] > 0) or
        (to_train['PAY_FEBRUARY'] > 0) or
        (to_train['PAY_JANUARY'] > 0)):
        return 1
    else:
        return 0
    
to_train['HAS_MISSED_PAY_BEFORE'] = to_train.apply(has_missed_pay, axis=1)
to_train.head()
Out[17]:
ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_APRIL PAY_FEBRUARY PAY_JANUARY PAY_JUNE PAY_MARCH PAY_MAY BILL_AMT_APRIL BILL_AMT_FEBRUARY BILL_AMT_JANUARY BILL_AMT_JUNE BILL_AMT_MARCH BILL_AMT_MAY PAY_AMT_APRIL PAY_AMT_FEBRUARY PAY_AMT_JANUARY PAY_AMT_JUNE PAY_AMT_MARCH PAY_AMT_MAY DEFAULT_JULY HAS_MISSED_PAY_BEFORE
0 9910 130000.0 2 2 1 27 0.0 0.0 0.0 0.0 0.0 0.0 81231.0 83773.0 85532.0 76885.0 81983.0 79106.0 3000.0 3260.0 3200.0 3500.0 3146.0 4000.0 0 0
1 15561 60000.0 2 3 1 48 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 8422.0 3910.0 2431.0 4823.0 4377.0 5471.0 4377.0 2431.0 2120.0 5491.0 3918.0 9683.0 1 0
2 23607 330000.0 2 1 2 44 2.0 0.0 0.0 0.0 0.0 0.0 253863.0 262753.0 268145.0 243621.0 257535.0 259517.0 9400.0 9766.0 9786.0 21400.0 9542.0 0.0 1 1
3 6314 60000.0 2 2 2 24 0.0 0.0 0.0 0.0 0.0 0.0 15069.0 5879.0 -2879.0 29832.0 15761.0 18628.0 1000.0 1329.0 41378.0 1538.0 118.0 1308.0 0 0
4 27534 180000.0 2 1 2 33 -1.0 -1.0 -1.0 -2.0 -1.0 -2.0 178.0 0.0 1118.0 4500.0 1500.0 2580.0 1500.0 1118.0 1331.0 2580.0 0.0 178.0 0 0
In [18]:
# This function counts the amount of times a person has defaulted before.
def times_missed_pay(to_train):
    times = 0
    for i in (to_train['PAY_JUNE'], to_train['PAY_MAY'], to_train['PAY_APRIL'], 
              to_train['PAY_MARCH'], to_train['PAY_FEBRUARY'], to_train['PAY_JANUARY']):
        if i > 0:
            times += 1
    return times

to_train['TIMES_MISSED_PAY'] = to_train.apply(times_missed_pay, axis=1)

to_train.head()
Out[18]:
ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_APRIL PAY_FEBRUARY PAY_JANUARY PAY_JUNE PAY_MARCH PAY_MAY BILL_AMT_APRIL BILL_AMT_FEBRUARY BILL_AMT_JANUARY BILL_AMT_JUNE BILL_AMT_MARCH BILL_AMT_MAY PAY_AMT_APRIL PAY_AMT_FEBRUARY PAY_AMT_JANUARY PAY_AMT_JUNE PAY_AMT_MARCH PAY_AMT_MAY DEFAULT_JULY HAS_MISSED_PAY_BEFORE TIMES_MISSED_PAY
0 9910 130000.0 2 2 1 27 0.0 0.0 0.0 0.0 0.0 0.0 81231.0 83773.0 85532.0 76885.0 81983.0 79106.0 3000.0 3260.0 3200.0 3500.0 3146.0 4000.0 0 0 0
1 15561 60000.0 2 3 1 48 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 8422.0 3910.0 2431.0 4823.0 4377.0 5471.0 4377.0 2431.0 2120.0 5491.0 3918.0 9683.0 1 0 0
2 23607 330000.0 2 1 2 44 2.0 0.0 0.0 0.0 0.0 0.0 253863.0 262753.0 268145.0 243621.0 257535.0 259517.0 9400.0 9766.0 9786.0 21400.0 9542.0 0.0 1 1 1
3 6314 60000.0 2 2 2 24 0.0 0.0 0.0 0.0 0.0 0.0 15069.0 5879.0 -2879.0 29832.0 15761.0 18628.0 1000.0 1329.0 41378.0 1538.0 118.0 1308.0 0 0 0
4 27534 180000.0 2 1 2 33 -1.0 -1.0 -1.0 -2.0 -1.0 -2.0 178.0 0.0 1118.0 4500.0 1500.0 2580.0 1500.0 1118.0 1331.0 2580.0 0.0 178.0 0 0 0
In [19]:
# This function calculates the average amount of debt paid by the person in question every month.
# Since this requires both the month of debt and the month of payment, it is programmed in a way
# that will only include the payment percentage if there was actual debt in a given month to avoid
# divisions by 0. Also, it doesn't check for June's debt since July isn't paid for yet.

# This function runs, but gets removed from the DataFrame as a result of low score in the correlation matrix.

def perc_paid_on_avg(to_train):
    months = {1:'JANUARY', 2:'FEBRUARY', 3:'MARCH', 4:'APRIL', 5:'MAY', 6:'JUNE'}
    ratios = []
    for i, j in months.items():
        if i == 6:
            break
        if (to_train[f'BILL_AMT_{j}'] == 0):
            continue
        debt_paid_ratio = 1 - ((to_train[f'BILL_AMT_{j}'] - to_train[f'PAY_AMT_{months[i+1]}']) / to_train[f'BILL_AMT_{j}'])
        ratios.append(debt_paid_ratio)
    if len(ratios) == 0:
        return -1
    result = sum(ratios)/len(ratios)
    return result
to_train['PERC_PAID_ON_AVG'] = to_train.apply(perc_paid_on_avg, axis=1)
to_train
Out[19]:
ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_APRIL PAY_FEBRUARY PAY_JANUARY PAY_JUNE PAY_MARCH PAY_MAY BILL_AMT_APRIL BILL_AMT_FEBRUARY BILL_AMT_JANUARY BILL_AMT_JUNE BILL_AMT_MARCH BILL_AMT_MAY PAY_AMT_APRIL PAY_AMT_FEBRUARY PAY_AMT_JANUARY PAY_AMT_JUNE PAY_AMT_MARCH PAY_AMT_MAY DEFAULT_JULY HAS_MISSED_PAY_BEFORE TIMES_MISSED_PAY PERC_PAID_ON_AVG
0 9910 130000.0 2 2 1 27 0.0 0.0 0.0 0.0 0.0 0.0 81231.0 83773.0 85532.0 76885.0 81983.0 79106.0 3000.0 3260.0 3200.0 3500.0 3146.0 4000.0 0 0 0 0.041150
1 15561 60000.0 2 3 1 48 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 8422.0 3910.0 2431.0 4823.0 4377.0 5471.0 4377.0 2431.0 2120.0 5491.0 3918.0 9683.0 1 0 0 1.031086
2 23607 330000.0 2 1 2 44 2.0 0.0 0.0 0.0 0.0 0.0 253863.0 262753.0 268145.0 243621.0 257535.0 259517.0 9400.0 9766.0 9786.0 21400.0 9542.0 0.0 1 1 1 0.038339
3 6314 60000.0 2 2 2 24 0.0 0.0 0.0 0.0 0.0 0.0 15069.0 5879.0 -2879.0 29832.0 15761.0 18628.0 1000.0 1329.0 41378.0 1538.0 118.0 1308.0 0 0 0 -0.041747
4 27534 180000.0 2 1 2 33 -1.0 -1.0 -1.0 -2.0 -1.0 -2.0 178.0 0.0 1118.0 4500.0 1500.0 2580.0 1500.0 1118.0 1331.0 2580.0 0.0 178.0 0 0 0 1.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
19995 28636 330000.0 2 1 2 33 -2.0 2.0 2.0 1.0 -1.0 -2.0 -200.0 150.0 350.0 -200.0 150.0 -200.0 350.0 200.0 0.0 0.0 0.0 0.0 0 1 3 0.580952
19996 17730 50000.0 1 2 2 49 0.0 0.0 0.0 0.0 0.0 0.0 48551.0 18077.0 18815.0 59236.0 34656.0 53689.0 2047.0 1000.0 2000.0 2000.0 5618.0 2124.0 1 0 0 0.100799
19997 28030 410000.0 1 1 2 32 -1.0 -1.0 -1.0 -1.0 -1.0 0.0 7764.0 4695.0 4133.0 3339.0 623.0 13047.0 623.0 5973.0 4421.0 10000.0 4695.0 7764.0 0 0 0 1.042331
19998 15725 200000.0 1 1 1 40 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 5074.0 5606.0 2534.0 -7.0 19268.0 1765.0 19364.0 2547.0 33300.0 1772.0 5634.0 5099.0 0 0 0 1.004800
19999 19966 60000.0 1 2 2 25 2.0 0.0 0.0 2.0 0.0 2.0 15444.0 17598.0 17472.0 36090.0 13695.0 22083.0 1000.0 1000.0 603.0 2017.0 5000.0 5.0 1 1 3 0.101208

20000 rows × 28 columns

In [20]:
# This function returns the amount of months that have passed since the last missed payment.

def mo_since_last_miss(to_train):
    months = {'JUNE':1, 'MAY':2, 'APRIL':3, 'MARCH':4, 'FEBRUARY':5, 'JANUARY':6}
    smallest = 7
    for i, j in months.items():
        if (to_train[f'PAY_{i}'] > 0) and (j<smallest):
            smallest = j
    if smallest == 7:
        smallest = 0
    return smallest

to_train['MO_SINCE_LAST_MISS'] = to_train.apply(mo_since_last_miss, axis=1)
to_train
Out[20]:
ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_APRIL PAY_FEBRUARY PAY_JANUARY PAY_JUNE PAY_MARCH PAY_MAY BILL_AMT_APRIL BILL_AMT_FEBRUARY BILL_AMT_JANUARY BILL_AMT_JUNE BILL_AMT_MARCH BILL_AMT_MAY PAY_AMT_APRIL PAY_AMT_FEBRUARY PAY_AMT_JANUARY PAY_AMT_JUNE PAY_AMT_MARCH PAY_AMT_MAY DEFAULT_JULY HAS_MISSED_PAY_BEFORE TIMES_MISSED_PAY PERC_PAID_ON_AVG MO_SINCE_LAST_MISS
0 9910 130000.0 2 2 1 27 0.0 0.0 0.0 0.0 0.0 0.0 81231.0 83773.0 85532.0 76885.0 81983.0 79106.0 3000.0 3260.0 3200.0 3500.0 3146.0 4000.0 0 0 0 0.041150 0
1 15561 60000.0 2 3 1 48 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 8422.0 3910.0 2431.0 4823.0 4377.0 5471.0 4377.0 2431.0 2120.0 5491.0 3918.0 9683.0 1 0 0 1.031086 0
2 23607 330000.0 2 1 2 44 2.0 0.0 0.0 0.0 0.0 0.0 253863.0 262753.0 268145.0 243621.0 257535.0 259517.0 9400.0 9766.0 9786.0 21400.0 9542.0 0.0 1 1 1 0.038339 3
3 6314 60000.0 2 2 2 24 0.0 0.0 0.0 0.0 0.0 0.0 15069.0 5879.0 -2879.0 29832.0 15761.0 18628.0 1000.0 1329.0 41378.0 1538.0 118.0 1308.0 0 0 0 -0.041747 0
4 27534 180000.0 2 1 2 33 -1.0 -1.0 -1.0 -2.0 -1.0 -2.0 178.0 0.0 1118.0 4500.0 1500.0 2580.0 1500.0 1118.0 1331.0 2580.0 0.0 178.0 0 0 0 1.000000 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
19995 28636 330000.0 2 1 2 33 -2.0 2.0 2.0 1.0 -1.0 -2.0 -200.0 150.0 350.0 -200.0 150.0 -200.0 350.0 200.0 0.0 0.0 0.0 0.0 0 1 3 0.580952 1
19996 17730 50000.0 1 2 2 49 0.0 0.0 0.0 0.0 0.0 0.0 48551.0 18077.0 18815.0 59236.0 34656.0 53689.0 2047.0 1000.0 2000.0 2000.0 5618.0 2124.0 1 0 0 0.100799 0
19997 28030 410000.0 1 1 2 32 -1.0 -1.0 -1.0 -1.0 -1.0 0.0 7764.0 4695.0 4133.0 3339.0 623.0 13047.0 623.0 5973.0 4421.0 10000.0 4695.0 7764.0 0 0 0 1.042331 0
19998 15725 200000.0 1 1 1 40 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 5074.0 5606.0 2534.0 -7.0 19268.0 1765.0 19364.0 2547.0 33300.0 1772.0 5634.0 5099.0 0 0 0 1.004800 0
19999 19966 60000.0 1 2 2 25 2.0 0.0 0.0 2.0 0.0 2.0 15444.0 17598.0 17472.0 36090.0 13695.0 22083.0 1000.0 1000.0 603.0 2017.0 5000.0 5.0 1 1 3 0.101208 1

20000 rows × 29 columns

In [21]:
train_x_corr = to_train.drop(['ID', 'PAY_JUNE', 'PAY_MAY', 'PAY_APRIL', 'PAY_MARCH', 'PAY_FEBRUARY', 'PAY_JANUARY', 'BILL_AMT_JUNE', 'BILL_AMT_MAY', 'BILL_AMT_APRIL', 'BILL_AMT_MARCH', 'BILL_AMT_FEBRUARY', 'BILL_AMT_JANUARY', 'PAY_AMT_JUNE', 'PAY_AMT_MAY', 'PAY_AMT_APRIL', 'PAY_AMT_MARCH', 'PAY_AMT_FEBRUARY', 'PAY_AMT_JANUARY',], axis=1) 
corr_matrix = train_x_corr.corr()
sn.heatmap(corr_matrix, annot=True)
plt.show()
In [22]:
# In here I removed all variables that had less than -.1 or .1 correlation with DEFAULT_JULY as well as
# all PAY_*, BILL_AMT_*, and PAY_AMT_* fields since the insights found on those were used to 
# obtain the calculated fields used instead.
train_x = to_train.drop(['DEFAULT_JULY', 'ID', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PERC_PAID_ON_AVG', 'PAY_JUNE', 'PAY_MAY', 'PAY_APRIL', 'PAY_MARCH', 'PAY_FEBRUARY', 'PAY_JANUARY', 'BILL_AMT_JUNE', 'BILL_AMT_MAY', 'BILL_AMT_APRIL', 'BILL_AMT_MARCH', 'BILL_AMT_FEBRUARY', 'BILL_AMT_JANUARY', 'PAY_AMT_JUNE', 'PAY_AMT_MAY', 'PAY_AMT_APRIL', 'PAY_AMT_MARCH', 'PAY_AMT_FEBRUARY', 'PAY_AMT_JANUARY'], axis=1) 
train_y = to_train['DEFAULT_JULY']
X_train, X_test, y_train, test_y = train_test_split(train_x, train_y, test_size=0.3, random_state=None)
train_x
Out[22]:
LIMIT_BAL HAS_MISSED_PAY_BEFORE TIMES_MISSED_PAY MO_SINCE_LAST_MISS
0 130000.0 0 0 0
1 60000.0 0 0 0
2 330000.0 1 1 3
3 60000.0 0 0 0
4 180000.0 0 0 0
... ... ... ... ...
19995 330000.0 1 3 1
19996 50000.0 0 0 0
19997 410000.0 0 0 0
19998 200000.0 0 0 0
19999 60000.0 1 3 1

20000 rows × 4 columns

In [23]:
# Random Forest was selected as the best algorithm to incorporate here, and runs with the parameters below.

parameters = dict(
    max_depth=range(3, 40, 2),
    min_samples_split=range(3, 40, 2),
    min_samples_leaf=range(1, 40, 2)
)

clf = RandomizedSearchCV(
    estimator=RandomForestClassifier(),
    param_distributions=parameters,
    n_iter=5,
    scoring="roc_auc_ovr_weighted",
    n_jobs=7,
    return_train_score=True,
    random_state = 73
)
In [24]:
clf.fit(X_train, y_train)
Out[24]:
RandomizedSearchCV(estimator=RandomForestClassifier(), n_iter=5, n_jobs=7,
                   param_distributions={'max_depth': range(3, 40, 2),
                                        'min_samples_leaf': range(1, 40, 2),
                                        'min_samples_split': range(3, 40, 2)},
                   random_state=73, return_train_score=True,
                   scoring='roc_auc_ovr_weighted')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomizedSearchCV(estimator=RandomForestClassifier(), n_iter=5, n_jobs=7,
                   param_distributions={'max_depth': range(3, 40, 2),
                                        'min_samples_leaf': range(1, 40, 2),
                                        'min_samples_split': range(3, 40, 2)},
                   random_state=73, return_train_score=True,
                   scoring='roc_auc_ovr_weighted')
RandomForestClassifier()
RandomForestClassifier()
In [25]:
print(clf.best_params_)
print(clf.best_score_)
{'min_samples_split': 29, 'min_samples_leaf': 15, 'max_depth': 3}
0.7459157990036696
In [26]:
results = pd.DataFrame(clf.cv_results_)
In [27]:
# Analyze results
test_results = pd.DataFrame(results).filter(regex="split\d_test|mean_test|std_test|param_")
test_results["CoV"] = test_results["std_test_score"]/test_results["mean_test_score"]*100
test_results.sort_values("mean_test_score", ascending=False)
Out[27]:
param_min_samples_split param_min_samples_leaf param_max_depth split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score CoV
1 29 15 3 0.728920 0.742342 0.760511 0.739772 0.758035 0.745916 0.011826 1.585441
3 31 1 5 0.728218 0.743847 0.761446 0.737782 0.754851 0.745229 0.011852 1.590358
2 13 23 23 0.726265 0.742583 0.758756 0.728890 0.750758 0.741451 0.012456 1.679942
4 7 21 31 0.725918 0.741037 0.758529 0.728081 0.751571 0.741028 0.012754 1.721166
0 37 7 29 0.722333 0.738835 0.755537 0.722627 0.747986 0.737464 0.013329 1.807412
In [28]:
# Analyze results
train_results = pd.DataFrame(results).filter(regex="split\d_train|mean_train|std_train|param_")
train_results["CoV"] = train_results["std_train_score"]/train_results["mean_train_score"]*100
train_results.sort_values("mean_train_score", ascending=False)
Out[28]:
param_min_samples_split param_min_samples_leaf param_max_depth split0_train_score split1_train_score split2_train_score split3_train_score split4_train_score mean_train_score std_train_score CoV
0 37 7 29 0.768684 0.764073 0.759709 0.767434 0.762267 0.764434 0.003293 0.430757
4 7 21 31 0.764165 0.760130 0.755074 0.763027 0.757524 0.759984 0.003374 0.443955
2 13 23 23 0.763914 0.759239 0.754816 0.762380 0.757087 0.759487 0.003335 0.439071
3 31 1 5 0.759369 0.754846 0.749555 0.757673 0.751587 0.754606 0.003655 0.484300
1 29 15 3 0.752203 0.748540 0.743673 0.749820 0.744567 0.747761 0.003209 0.429153
In [29]:
# In here I added the calculated fields above to the testing set and removed those unused as well.

test_data['HAS_MISSED_PAY_BEFORE'] = test_data.apply(has_missed_pay, axis=1)
test_data['TIMES_MISSED_PAY'] = test_data.apply(times_missed_pay, axis=1)
test_data['PERC_PAID_ON_AVG'] = test_data.apply(perc_paid_on_avg, axis=1)
test_data['MO_SINCE_LAST_MISS'] = test_data.apply(mo_since_last_miss, axis=1)
test_x = test_data.drop(['DEFAULT_JULY', 'ID', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PERC_PAID_ON_AVG', 'PAY_JUNE', 'PAY_MAY', 'PAY_APRIL', 'PAY_MARCH', 'PAY_FEBRUARY', 'PAY_JANUARY', 'BILL_AMT_JUNE', 'BILL_AMT_MAY', 'BILL_AMT_APRIL', 'BILL_AMT_MARCH', 'BILL_AMT_FEBRUARY', 'BILL_AMT_JANUARY', 'PAY_AMT_JUNE', 'PAY_AMT_MAY', 'PAY_AMT_APRIL', 'PAY_AMT_MARCH', 'PAY_AMT_FEBRUARY', 'PAY_AMT_JANUARY'], axis=1) 
test_x = test_x[test_x.columns]
proba = clf.predict_proba(test_x)
pred = clf.predict(test_x)
In [30]:
proba = proba[:,1]
In [31]:
test_y = test_data['DEFAULT_JULY']
auc_score = roc_auc_score(test_y, proba, average=None)
avg_auc_score = roc_auc_score(test_y, proba, average="weighted")
In [32]:
print(f"AUC Class: {round(auc_score,3)}")
print(f"Avg. AUC (Macro Wieighted): {round(avg_auc_score,3)}")
AUC Class: 0.756
Avg. AUC (Macro Wieighted): 0.756
In [33]:
print(classification_report(test_y, pred))
              precision    recall  f1-score   support

           0       0.83      0.95      0.89      7005
           1       0.64      0.32      0.43      1995

    accuracy                           0.81      9000
   macro avg       0.74      0.63      0.66      9000
weighted avg       0.79      0.81      0.78      9000

In [34]:
fig = go.Figure()

for i in clf.classes_:
    y_true = (test_y==i).astype(int).values
    y_score = proba
    fpr, tpr, _ = roc_curve(y_true, y_score)

    name = f"Class {i} (AUC={auc_score:.3f})"
    fig.add_trace(go.Scatter(x=fpr, y=tpr, name=name, mode='lines', line=dict(width=3)))

fig.update_layout(
    title="ROC for each Class",
    xaxis_title='False Positive Rate',
    yaxis_title='True Positive Rate',
    template="none",
    yaxis=dict(scaleanchor="x", scaleratio=0.5),
    width=1200, height=600
)

Question 1¶

(1.5 points)

In [35]:
sub = submission_features.copy()
sub = sub.set_index('ID')
sub
Out[35]:
LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_JUNE PAY_MAY PAY_APRIL PAY_MARCH PAY_FEBRUARY PAY_JANUARY BILL_AMT_JUNE BILL_AMT_MAY BILL_AMT_APRIL BILL_AMT_MARCH BILL_AMT_FEBRUARY BILL_AMT_JANUARY PAY_AMT_JUNE PAY_AMT_MAY PAY_AMT_APRIL PAY_AMT_MARCH PAY_AMT_FEBRUARY PAY_AMT_JANUARY
ID
2774 70000.0 2 3 2 22 0 0 0 0 0 0 52899.0 45758.0 37858.0 29660.0 29519.0 29457.0 1726.0 3028.0 1026.0 1195.0 1060.0 1031.0
15339 100000.0 2 2 1 38 1 2 0 0 2 0 14483.0 13961.0 15323.0 16268.0 15868.0 16448.0 0.0 1600.0 1500.0 0.0 1000.0 1500.0
26485 140000.0 2 1 2 40 1 2 0 0 0 0 143478.0 135633.0 128285.0 123074.0 97382.0 99364.0 0.0 6601.0 5522.0 3200.0 3300.0 3500.0
6657 380000.0 2 2 2 27 0 0 -1 0 0 0 4400.0 0.0 270219.0 3641.0 3717.0 3791.0 0.0 270219.0 131.0 135.0 136.0 76.0
7917 100000.0 1 2 2 26 0 0 0 0 0 0 32186.0 32949.0 33680.0 34419.0 37782.0 76185.0 2000.0 2000.0 2000.0 4500.0 40000.0 3000.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2501 280000.0 1 1 2 28 0 0 0 0 0 0 139719.0 136052.0 118076.0 101457.0 81124.0 42347.0 6000.0 4000.0 4000.0 4000.0 3000.0 1000.0
676 180000.0 1 2 1 48 -1 -1 -1 -1 -1 -1 1294.0 1294.0 1466.0 1294.0 2324.0 264.0 1294.0 1466.0 1294.0 2324.0 264.0 264.0
26934 90000.0 2 2 1 32 1 -1 -1 -1 -1 -1 0.0 323.0 0.0 2520.0 1651.0 0.0 323.0 0.0 2520.0 1651.0 0.0 0.0
27956 60000.0 2 2 2 28 2 2 2 2 2 2 48322.0 51587.0 52664.0 53059.0 51468.0 54511.0 4692.0 2500.0 2000.0 0.0 4019.0 0.0
6298 20000.0 2 1 2 24 0 0 0 0 0 0 15036.0 16069.0 16774.0 17266.0 17539.0 18033.0 1580.0 1280.0 770.0 700.0 785.0 600.0

1000 rows × 23 columns

In [36]:
sub = submission_features.copy()
sub['HAS_MISSED_PAY_BEFORE'] = sub.apply(has_missed_pay, axis=1)
sub['TIMES_MISSED_PAY'] = sub.apply(times_missed_pay, axis=1)
sub['MO_SINCE_LAST_MISS'] = sub.apply(mo_since_last_miss, axis=1)
sub = sub.drop(['ID', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_JUNE', 'PAY_MAY', 'PAY_APRIL', 'PAY_MARCH', 'PAY_FEBRUARY', 'PAY_JANUARY', 'BILL_AMT_JUNE', 'BILL_AMT_MAY', 'BILL_AMT_APRIL', 'BILL_AMT_MARCH', 'BILL_AMT_FEBRUARY', 'BILL_AMT_JANUARY', 'PAY_AMT_JUNE', 'PAY_AMT_MAY', 'PAY_AMT_APRIL', 'PAY_AMT_MARCH', 'PAY_AMT_FEBRUARY', 'PAY_AMT_JANUARY'], axis=1) 
sub_proba = clf.predict_proba(sub)
sub_proba = sub_proba[:,1]
sub_proba = pd.Series(sub_proba, name='PROBABILITY')
In [37]:
submission = pd.merge(submission_features.copy(), sub_proba, right_index=True, left_index=True)
submission = submission[['ID', 'PROBABILITY']]

# Submission dataset probabilities:
submission
Out[37]:
ID PROBABILITY
0 2774 0.143892
1 15339 0.555450
2 26485 0.439663
3 6657 0.094969
4 7917 0.114265
... ... ...
995 2501 0.103239
996 676 0.106378
997 26934 0.405543
998 27956 0.647503
999 6298 0.174278

1000 rows × 2 columns

In [38]:
# This creates the csv used in submission
#submission.to_csv('Submissions/FordArturo_C1_submission.csv', index=False)

Question 2¶

(1 points)

In [39]:
submission = submission.sort_values('PROBABILITY', axis=0)

# Customers with the lowest probability of defaulting:
submission.head(5)
Out[39]:
ID PROBABILITY
584 3237 0.087383
816 21358 0.087383
383 18523 0.087383
565 15141 0.087383
691 24668 0.091440

Question 3¶

(2.5 points)

In [40]:
# With the following code we calculate the optimal probability cutoff using the ROC calculations.

false_pos_rate, true_pos_rate, probab = roc_curve(test_y, proba)
optimal_proba_cutoff = sorted(list(zip(np.abs(true_pos_rate - false_pos_rate), probab)), key=lambda i: i[0], reverse=True)[0][1]
roc_predictions = [1 if i >= optimal_proba_cutoff else 0 for i in proba]
optimal_proba_cutoff
Out[40]:
0.25904195865470914
In [41]:
# Some accuracy statistics for evaluating different models

print("Accuracy Score Before and After Thresholding: {}, {}".format(accuracy_score(test_y, pred), accuracy_score(test_y, roc_predictions)))
print("Precision Score Before and After Thresholding: {}, {}".format(precision_score(test_y, pred), precision_score(test_y, roc_predictions)))
print("Recall Score Before and After Thresholding: {}, {}".format(recall_score(test_y, pred), recall_score(test_y, roc_predictions)))
print("F1 Score Before and After Thresholding: {}, {}".format(f1_score(test_y, pred), f1_score(test_y, roc_predictions)))
Accuracy Score Before and After Thresholding: 0.8095555555555556, 0.7406666666666667
Precision Score Before and After Thresholding: 0.6403596403596403, 0.44240570846075433
Recall Score Before and After Thresholding: 0.3213032581453634, 0.6526315789473685
F1 Score Before and After Thresholding: 0.4279038718291055, 0.5273390036452006
In [42]:
# Confusion matrix used to check the accuracy of the models attempted.

y_actual = pd.Series(train_y, name='Actual')
y_actual = y_actual.reset_index(drop=True)
y_predict_tf = pd.Series(roc_predictions, name='Predicted')
df_confusion = pd.crosstab(y_actual, y_predict_tf, rownames=['Actual'], colnames=['Predicted'], margins=True)
print (df_confusion)
Predicted     0     1   All
Actual                     
0          4736  2275  7011
1          1321   668  1989
All        6057  2943  9000